In this document I outline the considerations involved in designing an experiment which tests whether users, in an exploratory visual data analysis scenario, perform multiple comparisons corrections when they are comparing multiple hypotheses. A typical null hypothesis significance test is performed at an \(\alpha = 0.05\) which suggests that, one in every 20 “discoveries” using this procedure will be a false discovery. However, in a multiple comparisons procedure, the probability of finding at least one false positive increases as the number of comparisons increases.
In the following simulations, we will sample p-values directly. Based on the obtained simulations, we will generate data compatible with the p-value. An alternative approach would have been to simulate data and then calculate a p-value (for a particular null hypothesis). However, this can be time consuming as we would have to simulate random data, and calculate p-values; it also reduces control over the experimental stimuli because the inherent randomness in generating data can result in a set of stimuli which is very different from what we would desire—for example, it is possible to get data where the Pr(FP | stimuli) is zero, as all the p-values estimated from the data are greater than \(\alpha\), which would reduce the power of our experiment.
To sample p-values directly, we need to know the distribution of the p-value:
When the null is true: the distribution of the p-value is uniform if the null is true.
When the alternative hypothesis is true: Given by @Hung et al. The following functions are
relevant to determine the distribution of the p-value when the
alternative hypothesis is true. The following functions allow us to
simulate p-values when the alternative hypothesis is true; the function
rp allows us to draw random samples under the assumption
that the alternative hypothesis is true.
# PDF
f_p <- function(x, mu, sigma, K) {
dnorm(qnorm(1 - x) - sqrt(K) * mu / sigma) / dnorm(qnorm(1 - x))
}
# CDF
F_p <- function(x, mu, sigma, K) {
1 - pnorm(qnorm(1 - x) - sqrt(K) * mu/sigma)
}
# inverse CDF of p-value
F_p_inv <- function(q, mu, sigma, K, l = 0, u = 1){
uniroot(function(p) F_p(p, mu, sigma, K) - q, lower = l, upper = u)$root
}
# random sample from PDF of p-value using its invCDF
rp <- function(mu, sigma, K){
q <- runif(1)
F_p_inv(q, mu, sigma, K)
}
Next, let us verify the distribution of p-values obtained using our simulation. When the null is true, the distribution should be uniform:
mu = 3
sd = 10
m = c(1, 4, 8, 12, 20, 30, 50)
ntrials = 1000
p_null = 0.7
alpha = 0.05
set.seed(1234)
data.sim.pvals = crossing(m, p0 = p_null, alpha, trial = 1:ntrials) |>
mutate(
`mu > 0` = map2(m, p0, ~ sample(c(rep(0, round(.y * .x)), rep(1, round((1 - .y) * .x))), .x))
) |>
unnest(c(`mu > 0`)) |>
mutate(p_uncorrected = map2_dbl(`mu > 0`, m, ~ ifelse(.x == 0, runif(1, 0, 1), rp(mu, sd, .y))))
data.sim.pvals |>
ungroup() |>
filter(`mu > 0` == 0) |>
ggplot() +
geom_histogram(aes(p_uncorrected), boundary = 0, colour = "#ffffff", binwidth = 0.05) +
scale_x_continuous(breaks = seq(0, 1, by = 0.2))
Let’s also look at the distribution of p-values when the alternative hypothesis is true:
data.sim.pvals |>
ungroup() |>
filter(`mu > 0` == 1) |>
ggplot() +
geom_histogram(aes(p_uncorrected), boundary = 0, colour = "#ffffff", binwidth = 0.05) +
scale_x_continuous(breaks = seq(0, 1, by = 0.2))
To understand how the probability of finding at least one false
positive changes, let us look at a few different scenarios. We will set
and where m indicates the number of comparisons we are
making.
First, we look at the probability of finding at least one false positive, given the null hypothesis is always true. We first simulate the data, and then plot it. We can also plot the theoretical estimate \((1 - (1 - \alpha)^{\pi_0m}\) to ensure our calculations are correct:
set.seed(1234)
(p1 = crossing(m, p0 = 1, alpha = 0.05, trial = 1:ntrials) |>
mutate(
`mu > 0` = map2(m, p0, ~ sample(c(rep(0, round(.y * .x)), rep(1, round((1 - .y) * .x))), .x))
) |>
unnest(c(`mu > 0`)) |>
mutate(p = map2_dbl(`mu > 0`, m, ~ ifelse(.x == 0, runif(1, 0, 1), rp(3, 10, .y)))) |>
mutate(reject_null = p < alpha) |> # reject or not?
group_by(m, p0, trial) |>
mutate(TP = as.integer(`mu > 0` & reject_null), # correctness of the decision
FP = as.integer(!`mu > 0` & reject_null),
TN = as.integer(!`mu > 0` & !reject_null),
FN = as.integer(`mu > 0` & !reject_null),
) |>
summarise_at(.vars = vars(TP, FP, TN, FN), sum) |> # calculates total number of TP, TN, FP, FN for each trial
# because \pi_0 = 1, here, FDR will always be equal to 1
# thus correcting for FDR is the same as correcting for FWER
summarise(FWER = mean(FP > 0), .groups = "drop") |> # estimates Pr(FP) across the set of trial
mutate(`theoretical estimate` = 1 - 0.95^m) |>
pivot_longer(cols = c(FWER, `theoretical estimate`)) |>
ggplot() +
geom_point(aes(y = value, x = m, color = name)) +
geom_line(aes(y = value, x = m, color = name)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
labs(x = "m (number of comparisons)", y = "Probability of at least\none False Positive)"))
However, the probability of the null being true being equal to one is unrealistic. This can change the probability of finding at least one false positive:
p2 = data.sim.pvals |>
mutate(reject_null = p_uncorrected < alpha) |>
group_by(m, p0, trial) |>
mutate(TP = as.integer(`mu > 0` & reject_null),
FP = as.integer(!`mu > 0` & reject_null),
TN = as.integer(!`mu > 0` & !reject_null),
FN = as.integer(`mu > 0` & !reject_null),
) |>
summarise_at(.vars = vars(TP, FP, TN, FN), sum) |>
summarise(FWER = mean(FP > 0), .groups = "drop") |>
mutate(`theoretical estimate` = 1 - 0.95^(m*p0)) |>
pivot_longer(cols = c(FWER, `theoretical estimate`)) |>
ggplot() +
geom_point(aes(y = value, x = m, color = name)) +
geom_line(aes(y = value, x = m, color = name)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.2)) +
labs(x = "m (number of comparisons)", y = "Probability of at least\none False Positive)")
legend = get_legend(
# create some space to the left of the legend
p2 + theme(legend.box.margin = margin(0, 0, 0, 1))
)
plot_grid(
p1 + ggtitle(bquote({Pr(H[0]) == 1})) + theme(legend.position="none"),
p2 + ggtitle(bquote({Pr(H[0]) == 0.7})) + theme(legend.position="none"),
legend,
rel_widths = c(4, 4, 1),
nrow = 1
)
There exists multiple comparisons corrections procedures which allows an analyst to control False Discovery Rate (FDR) or the Family-Wise error rate (FWER) at a specific \(\alpha\) level. Below, we simulate p-values given \(Pr(H_0) = 0.7\), and using \(\alpha = 0.05\), compare the FDR and FWER for three strategies:
set.seed(1234)
data.m_compare.pvals = crossing(m, p0 = 0.7, alpha = 0.05, trial = 1:ntrials) |>
mutate(
`mu > 0` = map2(m, p0, ~ sample(c(rep(0, round(.y * .x)), rep(1, round((1 - .y) * .x))), .x))
) |>
unnest(c(`mu > 0`)) |>
mutate(
p_uncorrected = map2_dbl(`mu > 0`, m, ~ ifelse(.x == 0, runif(1, 0, 1), rp(3, 10, .y))),
p_BH = p.adjust(p_uncorrected, "BH"),
p_bonferroni = p.adjust(p_uncorrected, "bonferroni"),
) |>
pivot_longer(cols = starts_with("p_"), names_prefix = "p_", names_to = "method", values_to = "p_val") |>
group_by(method, m, p0, alpha, trial) |>
mutate(reject_null = p_val < alpha,
TP = as.integer(`mu > 0` & reject_null),
FP = as.integer(!`mu > 0` & reject_null),
TN = as.integer(!`mu > 0` & !reject_null),
FN = as.integer(`mu > 0` & !reject_null)) |>
summarise_at(.vars = vars(TP, FP, TN, FN), sum)
Next, we visualise the FDR and FWER for each of the two strategies:
p1 = data.m_compare.pvals |>
summarise(FWER = mean(FP > 0), .groups = "drop") |>
ggplot() +
geom_hline(yintercept = 0.05, linewidth = 0.5) +
geom_point(aes(y = FWER, x = m, color = method)) +
geom_line(aes(y = FWER, x = m, color = method)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1))
p2 = data.m_compare.pvals |>
mutate(FDR = ifelse((FP + TP) > 0, FP/(FP + TP), 0)) |>
summarise(FDR = mean(FDR), .groups = "drop") |>
ggplot() +
geom_hline(yintercept = 0.05, linewidth = 0.5) +
geom_point(aes(y = FDR, x = m, color = method)) +
geom_line(aes(y = FDR, x = m, color = method))
legend = get_legend( p2 + theme(legend.box.margin = margin(0, 0, 0, 1)) )
plot_grid(p1 + theme(legend.position="none"), p2 + theme(legend.position="none"), legend, rel_widths = c(4, 4, 1), nrow = 1)
From the plots above, we see that while Bonferroni corrects for the Family-Wise error rate, Benjamini-Hochberg corrects for the False Discorvery Rate.
However, in scenarios where the number of comparisons being performed
is small (for example m <= 10), we may not see sufficiently large FDR
as the number of instances with \(p <
0.05\) maybe itself be small. We refer to these as
low probability binomial events
By this, we are referring to a binomial random variable with small
values of p (i.e. close to zero). Consider a scenario where an event
occurs successfully with a probability p. In a single
trial, a participant attempts the event m times (which is
varied). Over a set of 100 trials, we measure the number of successful
events, by measuring a score, and explore the variance:
m = 2^c(1:8)
trials = c(20, 30, 40, 50, 100)
prob = c(0.05, 0.1, 0.2, 0.25, 0.3, 0.4, 0.5)
n = 100
set.seed(100)
simulations = crossing(m, trials, prob, n = 1:n) |>
mutate(
event = pmap(list(m, trials, prob), ~ rbinom(..2, ..1, ..3)),
score = map_dbl(event, ~ mean(.))/m,
trials = factor(trials)
)
simulations |>
group_by(m, trials, prob) |>
summarise(score = list(score), .groups = "keep") |>
median_qi(score) |>
ggplot() +
geom_pointinterval(aes(x = m, y = score, ymin = .lower, ymax = .upper), position = "dodge") +
geom_hline(aes(yintercept = prob)) +
geom_hline(aes(yintercept = 0), linewidth = 0.1, colour = "#efefef") +
geom_hline(aes(yintercept = 2*prob), linewidth = 0.1, colour = "#efefef") +
scale_x_log10(breaks = m) +
facet_grid(prob ~ trials, scales = "free") +
theme_minimal() +
theme(panel.spacing = unit(5, "mm"))
Based on the graphs above, here’s what occurs:
p, which is as expectedAnother way of visualising the same dataset:
simulations |>
mutate(n = factor(n)) |>
ggplot(aes(y = score, x = m, group = n)) +
geom_line(alpha = 0.1) +
geom_hline(aes(yintercept = prob)) +
geom_hline(aes(yintercept = 0), linewidth = 0.1, colour = "#efefef") +
geom_hline(aes(yintercept = 2*prob), linewidth = 0.1, colour = "#efefef") +
scale_x_log10(breaks = m) +
facet_grid(prob ~ trials, scales = "free") +
theme_minimal() +
theme(panel.spacing = unit(5, "mm"))
This has implications for our study design—to test the multiple
comparisons problem in exploratory visual data analysis. Consider that
NHST without multiple comparisons correction is being performed at a
particular \(\alpha\) level. The number
of comparisons is given by m, and the \(\alpha\) level is given by p.
Assuming that the null is true, the probability of a False Positive is
thus p. Thus, if we ask participants to perform NHST, we
would observe similar variance in the average number of False
Positives.
Thus, to reliably measure False positives, we want to set the number of potential False Positives sufficiently high. We can do so by using a higher value for \(\alpha\) and \(m\), s.t. \(\alpha \in \{0.2, 0.25, 0.3\}\) and \(m = \{8, 12, 16, 20, 24\}\). We don’t explore for values of m greater than 24 because it is not possible to show that many graphs on a screen.
Constraints: As described previously, we want set the number of potential False Positives sufficiently high. In other words, we want \(\pi_0 \in \{0.5, 0.6, 0.7, 0.8, 0.9\}\). Note that at the lower end of this combination of values (for example, \(\pi_0 = 0.5\) and \(m = 8\)), we should still expect to see a lot of variance as the maximum number of FP is 4, and the expected number of FP is ~1 depending on the value of \(\alpha\).
Here, I define some variables to help create nicer looking legends for the subsequent plots:
template = tibble(
m = sample(c(12, 16, 20), 10, replace = TRUE),
method = sample(c("BH", "bonferroni", "uncorrected", "reject_none"), 10, replace = TRUE),
FDR = runif(10)
) |>
mutate(method = factor(method, levels = c("bonferroni", "BH", "uncorrected", "reject_none")))
# some nicer looking legends
legend = get_legend(
ggplot(template, aes(x = m, y = FDR, colour = method)) + geom_point() +
guides(color = guide_legend(override.aes = list(size = 3)))
)
There are several variables which may impact the number of correct decisions:
Below, we explore how varying each of these values will impact the number of correct decisions as well as the payoff.
In the following simulation, we will specify different values of \(\pi_0\) to see how that impacts participants False Discovery Rates and payoff based on an incentive structure where {TP: 1, TN: 1, FP: \((1/\alpha - 1)\), FN: -1}, where \(\alpha \in \{0.2, 0.25, 0.3\}\).
First, we simulate a 1000 times, \(m\) p-values for each unique combination of m, \(Pr(H_0)\), and \(\alpha\):
mean = 3
sd = 10
m = c(10, 20, 30, 40, 50, 100)
ntrials = 1000
p_null = seq(0.5, 0.9, by = 0.1)
alpha = c(0.2, 0.25, 0.3)
set.seed(1234)
data.sim.pvals = crossing(m, p0 = p_null, alpha, trial = 1:ntrials) |>
mutate(
`mu > 0` = map2(m, p0, ~ sample(c(rep(0, round(.y * .x)), rep(1, round((1 - .y) * .x))), .x))
) |>
unnest(c(`mu > 0`)) |>
mutate(p_uncorrected = map2_dbl(`mu > 0`, m, ~ ifelse(.x == 0, runif(1, 0, 1), rp(mean, sd, .y))))
# saveRDS(data.sim.pvals, "../data/simulations/data.sim.pvals.rds")
Next, we calculate the corrected p-values, and then estimate the number of TP/TN/FP/FN:
data.sim.pvals = readRDS("../data/simulations/data.sim.pvals.rds")
data.sim.conf_matrix = data.sim.pvals |>
group_by(m, p0, alpha, trial) |>
mutate(
p_BH = p.adjust(p_uncorrected, "BH"),
p_bonferroni = p.adjust(p_uncorrected, "bonferroni"),
p_reject_none = 1,
p_reject_all = 0
) |>
pivot_longer(cols = starts_with("p_"), names_prefix = "p_", names_to = "method", values_to = "p_val") |>
group_by(method, m, p0, alpha, trial) |>
mutate(
reject_null = p_val < alpha,
TP = as.integer(`mu > 0` & reject_null), # correctness of the decision
FP = as.integer(!`mu > 0` & reject_null),
TN = as.integer(!`mu > 0` & !reject_null),
FN = as.integer(`mu > 0` & !reject_null)
) |>
summarise_at(.vars = vars(TP, FP, TN, FN), sum)
saveRDS(data.sim.conf_matrix, "../data/simulations/data.sim.conf_matrix.rds")
Since the above steps can be quite time-consuming, we store the simulated data frame as a RDS, and load it. The plot below shows the False Discovery Rates for Benjamini-Hochberg, Bonferroni and uncorrected strategies. We omit the others (reject all and reject none) as these would lead to very high False Positives and zero False Positives respectively:
data.sim.conf_matrix = readRDS("../data/simulations/data.sim.conf_matrix.rds")
data.qi.fdr = data.sim.conf_matrix |>
filter(method != "reject_all") |>
mutate(
FDR = ifelse(TP + FP > 0, FP / (TP + FP), 0),
payoff = TP + TN - FN - (1/alpha - 1)*FP
) |>
mean_qi(FDR) |>
mutate(method = fct_relevel(method, c("bonferroni", "BH", "uncorrected", "reject_none")))
p.fdr = data.qi.fdr |>
filter(method != "reject_all" & method != "reject_none") |>
ggplot() +
geom_pointinterval(
aes(x = alpha, y = FDR, ymin = .lower, ymax = .upper, colour = method),
linewidth = 1, fatten_point = 1.2, alpha = 0.7, position = position_dodge(width = 0.02)
) +
geom_line(aes(alpha, FDR, colour = method), position = position_dodge(width = 0.01)) +
facet_grid(m ~ p0, scales = "free_y") +
scale_x_continuous(breaks = seq(0.2, 0.3, by = 0.05)) +
theme(panel.spacing = unit(5, "mm"), legend.position="none")
plot_grid(p.fdr, legend, rel_widths = c(9, 1), nrow = 1)
From the plot above, we see that the differences in False Discovery Rates are most easily distinguishable when \(Pr(H_0) \geq 0.7\) and \(m \geq 20\). The differences become larger as \(Pr(H_0)\) and \(m\) increases. Below, we show the payoffs obtained using the following incentive structure: {TP: 1, TN: 1, FP: \((1/\alpha - 1)\), FN: -1}. We again omit reject all as it leads to incredibly low payoffs due to the high penalty for False Discoveries.
p.payoff = data.sim.conf_matrix |>
mutate(
FDR = ifelse(TP + FP > 0, FP / (TP + FP), 0),
payoff = TP + TN - FN - (1/alpha - 1)*FP
) |>
mean_qi(payoff)|>
mutate(method = fct_relevel(method, c("bonferroni", "BH", "uncorrected", "reject_none", "reject_all"))) |>
filter(method != "reject_all") |>
ggplot() +
geom_pointinterval(
aes(x = alpha, y = payoff, ymin = .lower, ymax = .upper, colour = method),
linewidth = 1, fatten_point = 1.2, alpha = 0.7, position = position_dodge(width = 0.02)
) +
geom_line(aes(alpha, payoff, colour = method), position = position_dodge(width = 0.02)) +
facet_grid(m ~ p0, scales = "free_y") +
scale_x_continuous(breaks = seq(0.2, 0.3, by = 0.05)) +
theme(panel.spacing = unit(5, "mm"), legend.position="none")
plot_grid(p.payoff, legend, rel_widths = c(9, 1), nrow = 1)
The graphs above suggest that, if we want to be able to observe a difference between the use of a multiple comparisons correction strategy such as BH over an uncorrected strategy, we need to be asking people to perform a large number of comparisions (\(m \in [20, 30]\)) at a \(pi_0 \geq 0.7\) and \(\alpha = 0.25\)
One thing we did not vary in our previous simulation was the relative values of the mean and standard deviation (i.e. effect size). This plays a role in the number of TP and FN as it controls the distribution of the p-value when the alternative hypothesis is true. Next, we can examine what happens if the effect size varies. To impose some constraints on the number of variables: \(\pi_0 \in \{0.7, 0.8\}\), \(m = 20\) and \(\alpha = 0.25\)
mu = c(1, 2, 3, 4, 5)
sd = 10
m = 20
ntrials = 1000
p_null = seq(0.7, 0.8, by = 0.1)
alpha = 0.25
set.seed(1234)
data.sim.effect_size_varying.pvals = crossing(m, p0 = p_null, alpha, mu, sd, trial = 1:ntrials) |>
mutate(
`mu > 0` = map2(m, p0, ~ sample(c(rep(0, round(.y * .x)), rep(1, round((1 - .y) * .x))), .x)),
effect_size = mu/sd
) |>
unnest(c(`mu > 0`)) |>
mutate(p_uncorrected = pmap_dbl(list(`mu > 0`, m, mu, sd), ~ ifelse(..1 == 0, runif(1, 0, 1), rp(..3, ..4, ..2))))
data.sim.effect_size_varying.outcomes = data.sim.effect_size_varying.pvals |>
group_by(m, p0, alpha, effect_size, trial) |>
mutate(
p_BH = p.adjust(p_uncorrected, "BH"),
p_bonferroni = p.adjust(p_uncorrected, "bonferroni"),
p_reject_none = 1
) |>
pivot_longer(cols = starts_with("p_"), names_prefix = "p_", names_to = "method", values_to = "p_val") |>
group_by(method, m, p0, alpha, effect_size, trial) |>
mutate(
reject_null = p_val < alpha,
TP = as.integer(`mu > 0` & reject_null), # correctness of the decision
FP = as.integer(!`mu > 0` & reject_null),
TN = as.integer(!`mu > 0` & !reject_null),
FN = as.integer(`mu > 0` & !reject_null)
) |>
summarise_at(.vars = vars(TP, FP, TN, FN), sum)
saveRDS(data.sim.effect_size_varying.outcomes, "../data/simulations/data.sim.varying_outcomes.rds")
We then re-create the plot which visualises median point estimates and error bars for each strategy
data.sim.effect_size_varying.outcomes = readRDS("../data/simulations/data.sim.varying_outcomes.rds") |>
mutate(method = factor(method, levels = c("bonferroni", "BH", "uncorrected", "reject_none")))
p.effect_size = data.sim.effect_size_varying.outcomes |>
mutate( FDR = ifelse(TP + FP > 0, FP / (TP + FP), 0) ) |>
ggplot(aes(x = effect_size, y = FDR, colour = method)) +
stat_pointinterval(point_interval = "mean_qi", .width = 0.8, position = position_dodge(width = 0.02)) +
facet_grid(. ~ p0, scales = "free_y") +
scale_x_continuous(breaks = seq(0.1, 0.5, by = 0.1)) +
labs(x = "effect size (standardised)") +
theme(panel.spacing = unit(5, "mm"), legend.position="none")
plot_grid(p.effect_size, legend, rel_widths = c(9, 1), nrow = 1)
Since effect size impacts FDR, we try to derive a functional form for the false discovery rate as a function of effect size:
\[ E(FP | \pi_0, m) = \alpha \pi_0m \\ E(FDR) = E(\frac{FP}{FP + TP}) \geq \frac{\alpha \pi_0m}{\alpha \pi_0m + (1 - \pi_0)m} = \frac{\alpha \pi_0}{\alpha \pi_0 + (1 - \pi_0)} \] In the above set of equations, we can identify a lower bound on the false discovery rate for the Benjamini Hochberg strategy. By assuming that there are no false negatives, it would imply that all the positives were correctly identified (this is usually the case only for large effect sizes). Since false negatives are not considered in the equation for computing FDR, and the number of true positives can only be less that \(((1 - \pi_0)m)\), this gives us a lower bound. We can visualise it below. The black line indicates the theoretical estimate while the red line indicates the Expectation of the False Discovery Rate:
data.sim.effect_size.bounds = crossing(m, p0 = seq(0.1, 0.9, by = 0.1), alpha, mu = 2:10, sd, trial = 1:1e3) |>
mutate(
`mu > 0` = map2(m, p0, ~ sample(c(rep(0, round(.y * .x)), rep(1, round((1 - .y) * .x))), .x)),
effect_size = mu/sd
) |>
unnest(c(`mu > 0`)) |>
mutate(p_val = pmap_dbl(list(`mu > 0`, m, mu, sd), ~ ifelse(..1 == 0, runif(1, 0, 1), rp(..3, ..4, ..2)))) |>
group_by(m, p0, alpha, effect_size, trial) |>
mutate(
reject_null = p_val < alpha,
TP = as.integer(`mu > 0` & reject_null), # correctness of the decision
FP = as.integer(!`mu > 0` & reject_null),
TN = as.integer(!`mu > 0` & !reject_null),
FN = as.integer(`mu > 0` & !reject_null)
) |>
summarise_at(.vars = vars(TP, FP, TN, FN), sum) |>
mutate( FDR = ifelse(TP + FP > 0, FP / (TP + FP), 0) )
data.sim.effect_size.bounds |>
saveRDS("../data/simulations/data.sim.effect_size.bounds.rds")
data.sim.effect_size.bounds = readRDS("../data/simulations/data.sim.effect_size.bounds.rds")
data.sim.effect_size.bounds |>
group_by(p0, effect_size) |>
mean_qi(FDR) |>
ggplot() +
geom_line(aes(p0, FDR), colour = "red") +
# geom_pointinterval(aes(p0, FDR, ymin = .lower, ymax = .upper), colour = "red") +
geom_line(aes(p0, p0*alpha / (alpha*p0 + 1 - p0))) +
# geom_point(aes(p0, p0*alpha / (alpha*p0 + 1 - p0))) +
facet_wrap( ~ effect_size)
The final variable that we can manipulate, which will impact participants’ behavior is the incentive matrix. The incentive matrix is associated with a specific reward for each of a True Positive, True Negative, False Positive, False Negative:
Below, we show the payoff from a commonly used incentive matrix for TP/TN/FN/FP: 1, 1, -1, -(\(\frac{1}{\alpha} - 1\)), with a baseline pay of 5 and the reward being scaled by 10:
p.effect_payoff = data.sim.effect_size_varying.outcomes |>
mutate( payoff = TP - 3*FP + TN - FN ) |>
ggplot(aes(x = effect_size, y = payoff, colour = method)) +
stat_pointinterval(point_interval = "mean_qi", .width = 0.8, position = position_dodge(width = 0.04)) +
facet_grid(. ~ p0, scales = "free_y") +
labs(x = "effect size (standardised)") +
scale_x_continuous(breaks = seq(0.1, 0.5, by = 0.1)) +
theme(panel.spacing = unit(5, "mm"), legend.position="none")
plot_grid(p.effect_payoff, legend, rel_widths = c(9, 1), nrow = 1)
There appears to be very little distinction between all the correction procedures. Let us look at the FDR, as well as at individual TP/TN/FP/FN rates to better understand what’s going on:
p1 = data.sim.effect_size_varying.outcomes |>
pivot_longer(c(TP:FN), names_to = "outcome", values_to = "frequency") |>
group_by(method, m, p0, alpha, effect_size, outcome) |>
mean_qi(frequency, .width = 0.8) |>
ggplot() +
geom_pointinterval(aes(x = effect_size, y = frequency, ymin = .lower, ymax = .upper, colour = method), position = position_dodge(width = 0.05)) +
facet_grid(outcome ~ p0, scales = "free_y") +
labs(x = "effect size (standardised)") +
theme(panel.spacing = unit(5, "mm"), legend.position="none")
p2 = data.sim.effect_size_varying.outcomes |>
group_by(method, m, p0, alpha, effect_size, trial) |>
summarise_at(.vars = vars(TP, FP, TN, FN), sum) |>
mutate(FDR = ifelse((FP + TP), FP/(FP + TP), 0)) |>
mean_qi(FDR, .width = 0.8) |>
ggplot() +
geom_pointinterval(aes(x = effect_size, y = FDR, ymin = .lower, ymax = .upper, colour = method), position = position_dodge(width = 0.05)) +
facet_grid(. ~ p0, scales = "free_y") +
labs(x = "effect size (standardised)") +
theme(panel.spacing = unit(5, "mm"), legend.position="none")
plot_grid(
plot_grid(p1, p2, ncol = 1, rel_heights = c(5, 2)),
legend,
nrow = 1,
rel_widths = c(10, 1)
)
First, we should ignore the scenarios with extremely small effect sizes as the signal is not strong enough to be detected empirically. Additionally, for small to medium effect sizes i.e. \(d \geq 0.3\), what we observe is that a multiple comparisions procedure such as BH results in very low FP (as does bonferroni) at the expense of some TP (when compared to the uncorrected strategy); on the other hand, it also results in less FN (when compared to Bonferroni) and more TN (when compared to the uncorrected strategy).
Let us look at some other formulations of the incentive structure. In traditional NHST, False Discovery Rates are controlled at \(\alpha = 0.05\), which implies that False Positives are 19 times as costly as a True Positive. As Benjamini-Hochberg is a False Discovery Rate correction procedure, it makes sense to more strongly penalise FP. This leads to an incentive matrix of 1/1/-1/-19:
(data.sim.effect_size_varying.outcomes |>
mutate( payoff = (TP + TN - FN - 19*FP) ) |>
ggplot(aes(x = effect_size, y = payoff, colour = method)) +
stat_pointinterval(point_interval = "mean_qi", .width = 0.8, position = position_dodge(width = 0.05)) +
facet_grid(. ~ p0, scales = "free_y") +
labs(x = "effect size (standardised)") +
scale_x_continuous(breaks = seq(0.1, 0.5, by = 0.1)) +
theme(panel.spacing = unit(5, "mm"), legend.position="none")) |>
plot_grid(legend, rel_widths = c(10, 1))
However, the relationships between a TP and a TN, and that between a TP and a FN are not clear. Below, we explore an alternative formulation which strongly rewards a TP (because of a successful discovery), weakly rewards TN (identifying what does not work), strongly penalises FN (due to the potential for missing an important discovery), extremely strongly penalises FP (many costs associated with false claiming a discovery). We can try to visualise the payoff based on such an incentive structure: 5/1/-4/-15
(data.sim.effect_size_varying.outcomes |>
mutate( payoff = (5*(TP-3*FP) + 1*TN + -4*FN) ) |>
ggplot(aes(x = effect_size, y = payoff, colour = method)) +
stat_pointinterval(point_interval = "mean_qi", .width = 0.8, position = position_dodge(width = 0.05)) +
facet_grid(. ~ p0, scales = "free_y") +
labs(x = "effect size (standardised)") +
scale_x_continuous(breaks = seq(0.1, 0.5, by = 0.1)) +
scale_y_continuous(breaks = seq(-100, 50, by = 20)) +
theme(panel.spacing = unit(5, "mm"), legend.position="none")) |>
plot_grid(legend, rel_widths = c(10, 1))
The incentive structure used above results in sufficient differences, on average, between the optimal BH and uncorrected strategies particularly when \(\delta = 0.4\) and \(Pr(H_0) = 0.7\). This also ensures that the a multiple correctetions strategy is superior to a uniform reject all criteria. From the plot above, we see that the average difference between the BH and uncorrected strategies is approximately 25 payoff units. When selecting the stimuli, we can use rejection sampling to ensure that the stimuli we select has at least the same amount of difference. We will then randomly sample the required number of trials out of the sample which fits this criteria.